A clustering approach to improve our understanding of the genetic and phenotypic complexity of chronic kidney disease

Chronic kidney disease (CKD) is a complex disorder that causes a gradual loss of kidney function, affecting approximately 9.1% of the world's population. Here, we use a soft-clustering algorithm to deconstruct its genetic heterogeneity. First, we selected 322 CKD-associated independent genetic variants from published genome-wide association studies (GWAS) and added association results for 229 traits from the GWAS catalog. We then applied nonnegative matrix factorization (NMF) to discover overlapping clusters of related traits and variants. We computed cluster-specific polygenic scores and validated each cluster with a phenome-wide association study (PheWAS) on the BioMe biobank (n = 31,701). NMF identified nine clusters that reflect different aspects of CKD, with the top-weighted traits signifying areas such as kidney function, type 2 diabetes (T2D), and body weight. For most clusters, the top-weighted traits were confirmed in the PheWAS analysis. Results were found to be more significant in the cross-ancestry analysis, although significant ancestry-specific associations were also identified. While all alleles were associated with a decreased kidney function, associations with CKD-related diseases (e.g., T2D) were found only for a smaller subset of variants and differed across genetic ancestry groups. Our findings leverage genetics to gain insights into the underlying biology of CKD and investigate population-specific associations.

Over the past few decades, many large-scale genomic studies, such as genome-wide association studies (GWAS), have successfully identified more than 500 independent genetic variants associated with reduced kidney function [10][11][12] .The association between genetic variants and various phenotypes has been studied, and the results are often shared in publicly available databases, like the GWAS Catalog 13 .The association of one genetic variant with multiple traits can be considered to identify secondary traits associated with a phenotype.This understanding can help elucidate potentially shared disease mechanisms, assuming that genetic variants affecting a shared pathway have a similar impact on the associated traits.
Soft-clustering methods provide a means to reduce the genetic complexity of a heterogeneous disease while also accounting for shared disease mechanisms.In contrast to hard-clustering approaches like K-means or hierarchical clustering, soft-clustering enables the factorization of high-dimensional data by identifying overlapping clusters 14 .Non-negative matrix factorization (NMF) is a family of algorithms within multivariate analysis that addresses the dimensionality challenge by extracting meaningful features from a given data set 15,16 .Using this approach, different genetically driven subtypes have been identified for Type 2 Diabetes in the past and were even associated with differences in the clinical outcomes [17][18][19][20] .
In this study, we aimed to deconstruct the heterogeneity of CKD by identifying its genetic subtypes.First, we collect all published variant-trait associations for variants associated with reduced kidney function and apply soft-clustering using NMF.We used the algorithm's weights to calculate cluster-specific polygenic scores (cPGS) within the BioMe biobank.Finally, we use a phenome-wide association study (cPGS-PheWAS) to validate and interpret the clusters.By deconstructing the complexity of CKD, this methodology contributes new insights into the disease pathways of CKD and enhances our understanding of population-specific differences for CKD.

NMF identified nine clusters of CKD-associated variants
We identified 508 independent genetic variants associated with decreased kidney function from the literature.Then, we retrieved all proxy SNPs in linkage disequilibrium with the identified variants and linked them to 805 associated traits using the GWAS Catalog database.After multiple filtering steps, including GWAS sample size thresholds and correction of GWAS p-values for multiple associations, we constructed a final matrix of traitvariant associations, which included 322 variants and 229 associated traits.We applied NMF to factorize the variants-traits association matrix X into a traits matrix W and variants matrix H with a shared number of clusters, k (Fig. 1).We identified nine clusters of CKD-associated variants using a hypothesis-free approach.
The most frequent CKD-associated secondary traits retrieved from the GWAS Catalog are related to kidney function (e.g., blood urea nitrogen, urea, uric acid, and cystatin C measurements), hemoglobin levels (e.g., hemoglobin measurements, hematocrit, and erythrocyte counts), T2D, body weight (e.g., body height, appendicular lean mass, BMI, BMI-adjusted waist-hip ratio), and pulse pressure (systolic and diastolic blood pressure measurements), among others (see Fig. S1).CKD-associated traits and their associated CKD variants were factorized into nine partly overlapping clusters by conducting NMF.To ensure the results were robust, we repeated the clustering with Bayesian NMF (bNMF) and got comparable results (Table S1).The top seven traits per cluster are summarised in Fig. 2. The 'Reduced lipids' cluster is associated with decreasing blood lipid levels (triglycerides, total cholesterol, use of lipid-lowering medications) and liver enzymes.The top traits of the cluster 'Increased body mass' show a positive association with body weight (appendicular lean mass, body height, and body weight).The clusters 'Increased blood volume' and 'Reduced blood volume' are positively and negatively associated with volemic traits (e.g., mean corpuscular volume and mean corpuscular hemoglobin), respectively.
Similarly, clusters 'Increased/Reduced hematocrit' show opposite associations with hemoglobin content (e.g., hematocrit, hemoglobin measurements, red blood cell density, erythrocyte count), and clusters 'Increased/ Reduced inflammation' convey opposite associations with markers of inflammation (C-reactive protein) and blood lipids.Lastly, cluster 'Increased urate' is positively associated with kidney function biomarkers like urate, blood/serum urea nitrogen, blood proteins, and Cystatin C. The complete lists of the top features and variants per cluster, defined as traits and variants in the top decile of the cluster weights of the matrices H and W, are listed in Table S2.The matrices H and W are also available as supplementary material.Figure S2 summarises how the variants are distributed in each cluster, showing their overlaps.
Comparing three different pathway analysis approaches, we could identify significantly enriched pathways (q-value < 0.05) based on Ingenuity Pathway Analysis (IPA) for four of the nine identified clusters (Fig. S3).Overall, only two of the 22 identified enriched clusters were enriched in more than one of the clusters, namely, the myelination signaling pathway (enriched for the 'Increased hematocrit' and 'Reduced urate' clusters) and the estrogen receptor signaling pathway (enriched for the 'Reduced hematocrit' and 'Reduced urate' clusters).For some of the enriched pathways, we could identify an association with the top-weighted traits of the corresponding clusters, suggesting their biological plausibility.For instance, the IL-12 Signaling and Production in Macrophages is significantly enriched (q-value = 0.02) for the top-weighted genes of the increased inflammation cluster.IL-12 is a pro-inflammatory cytokine that, in the past, has been associated with multiple immune-mediated diseases 21 .For the 'Increased hematocrit' cluster, characterized by increased hematocrit, hemoglobin, red blood cell density measures, and erythrocyte count, BMP signaling was one of the significantly enriched pathways (q-value = 0.03).Bone Morphogenic Protein (BMP) has been implicated with hematopoiesis 22 .The top associated genes of the 'Increased urate cluster' suggest alterations of the metabolomic and renal functions.These suggestions can be validated by the Xenobiotic Metabolism AHR Signaling pathway, which can be activated in CKD patients due to the accumulation of uremic toxins and can promote renal fibrosis 23 .

PheWAS validated biological pathways of most clusters across ancestries
Each cluster was examined by conducting a cPGS-PheWAS with 988 quantitative traits and 832 binary traits on the four BioMe cohorts (ALL, AFR, AMR, EUR).Except for the 'Increased body mass' (abbreviated as 'IBM'), 'Reduced lipids' , and the 'Reduced inflammation' clusters, we validated at least 2 of the clusters' top 5 traits (Fig. 3).The level of significance was reached more frequently across ancestries (ALL) than when validating on the individual ones (AMR, AFR, EUR) (Table S3).In addition to the validated traits, significant associations with decreasing eGFR were seen in clusters 'Increased urate' (β = − 0.04 [− 0.06 to − 0.03], p-value = 6.7e−09) and 'Reduced  S3).'Reduced hematocrit' was also nominally associated with an increased risk for chronic renal failure (OR = 1.S3. www.nature.com/scientificreports/cPGSs suggest apparent differences between genetic ancestries We extracted the cluster weights of the W matrix and used them to calculate cluster-specific polygenic scores (cPGS) for participants of the BioMe cohort.Figure 5 shows the standardized polygenic score distributions for all NMF clusters across the BioMe cohort (ALL) and the individual continental populations EUR (n = 7447), AMR (n = 5336), and AFR (n = 5660).A normal distribution was observed for the cluster 'Increased urate' (EUR, AMR, and AFR; Anderson-Darling test, all p-values available in Table S5).Although polygenic scores are expected to have a normal distribution 24 , the other eight clusters present either a skewed tail (e.g., 'Increased hematocrit') or several peaks in their cPGS distributions (e.g., 'Reduced inflammation').As illustrated in Fig. 4, the peaks are caused by a few variants with relatively high cluster weights (the complete list of cluster weights for the top variants of each cluster is available in Table S2).For example, the top variant in cluster 'Increased inflammation' (rs429358, mapped gene: APOE) weighs 4.6, while the second one (rs17050272) weighs 0.2.In Fig. 5, we can also observe how this variant is more frequent in participants of inferred EUR ancestry.Similarly, the top variant of 'Reduced inflammation' (rs1260326, mapped gene: GCKR) weighs 5.7 and seems more frequent in the AFR population, while the second one (rs4418728) weighs 0.9.This unbalance in weight creates the three peaks of the distributions: the lower peak includes the scores of individuals without the top variant (0 copies), the middle one the heterozygous (1 copy), and the higher peak includes scores of participants with two copies of the top variant.
Other ancestry-specific differences are visible in the distributions of four clusters and are significant when testing with the Mann-Whitney test (all p-values available in Table S4).This suggests that some variants appear with different frequencies in people that do not share similar ancestry: 'Increased inflammation' (all combinations), 'Reduced inflammation' (all combinations), 'Reduced lipids' (EUR vs. AFR), and 'Increased body mass' (EUR vs AFR and AFR vs AMR).www.nature.com/scientificreports/

Distribution of participants across clusters
As the cPGS are calculated separately per cluster, each BioMe participant might have high polygenic scores in multiple clusters.Therefore, to understand the cluster overlap in terms of relative risk, we checked how many individuals belonged to the top decile of 1 or more clusters.58% (18,431/31,701) of the whole BioMe cohort (ALL) were at high risk in at least 1 cluster.Of these, 60.2% were in the top decile for only 1 cluster, while 37% were at risk for 2-3 clusters (Fig. S5).

Discussion
CKD is typically defined as a progressive loss of kidney function over time.Although numerous genetic variants have been identified as associated with CKD, their relationship to disease pathways remains largely unclear.The work described here is the most comprehensive assessment of how variants associated with CKD can be grouped according to different CKD-related factors.Specifically, we included variant-trait associations of 322 CKD SNPs and 229 related metabolic traits from publicly available GWAS datasets.By analyzing these associations with NMF, a factorization approach that allows for minimal overlap between groups, we identified 9 clusters of CKD variants and associated traits.CKD is commonly recognized as a heterogeneous condition with various underlying causes and risk factors, which are unlikely to represent a single disease process.This complexity is also reflected by the associated traits retrieved from published GWAS, which are related to kidney function, hemoglobin levels, T2D, body weight, and pulse pressure, among others.Attempting to deconvolute CKD's genetic heterogeneity and differentially grouping these traits, the nine clusters we identified represented different aspects of CKD.For example, the 'Increased urate' cluster, whose clustering weights represent abnormal levels of urinary metabolites like urate, blood/serum urea nitrogen, blood proteins, and Cystatin C, is related to decreasing kidney function.In normal conditions, such blood metabolites are excreted by the kidneys, but in CKD, they accumulate and exert a detrimental biological activity 25,26 .A second cluster, which we summarised as 'Increased inflammation, ' was strongly clustered around rising serum C-reactive protein (CRP) concentrations.CRP is a common inflammatory biomarker in chronic diseases like CKD, diabetes, and cardiovascular diseases [27][28][29] .In line with that, patients with CKD commonly experience chronic inflammatory states 30 .These states tend to worsen as the disease progresses toward end-stage renal disease and are reflected, or even modulated 31 , by increasing CRP levels [32][33][34] .For these two pathways, we identified significantly enriched pathways using the top 25% weighted genes in IPA that are in line with the described top-weighted traits.Overall, the difference in significantly enriched pathways between the clusters suggests biological differences between them.
We then studied the genotype-phenotype correlation to demonstrate the utility of the clusters.We could validate most of the top-weighted features on quantitative traits (i.e., biomarkers), while the validation on binary traits (i.e., diagnoses) was less robust and required additional clinical interpretation.For the clusters of 'Increased urate' and 'Increased inflammation, ' the top traits were confirmed by the PheWAS.CKD is also associated with dyslipidemia, which is comprised of high levels of triglycerides and LDL-cholesterol, and low levels of HDLcholesterol and apolipoprotein A1 35 .We could observe similar associations in clusters 'Increased inflammation, ' 'Reduced lipids,' and 'Reduced inflammation.'Notably, we found multiple significant associations for cluster 'Increased inflammation' with reduced risk of dementias.Glycerophospholipids play an essential role in neural membranes 36,37 , and their levels are directly correlated with serum triglycerides and inversely correlated with total cholesterol and eGFR 38 .
A limitation of this study is the need for more genetic diversity in the GWAS Catalog, which mainly consists of studies performed on the European population.This European bias is well described in the literature and has important implications for disease risk prediction across global populations 39 .Despite this lack of genetic diversity, which affected the initial selection of variants used in the input matrix, we could still validate our results in BioMe, a biobank enriched for populations with non-European ancestries.We were most powered when jointly analyzing across ancestries (ALL), while signals validated in different ancestral groups with some group-specific differences.This result suggests that ancestry-specific studies are essential although most CKD risk factors converge across ancestral groups.Another two limitations are the filtering rules used to select traits and variants for the algorithm's input matrix and the possible existence of non-additive interactions between risk factors that we did not consider in this study.Lastly, one of the input CKD studies, the PAGE study 40 , was also conducted using BioMe data.However, this should not impact the results since we are not looking at CKD case/control scenarios but at CKD subtypes.
Understanding the biological pathways that lead to CKD is essential to improve clinical management.For example, some clusters group similar traits but with opposite effect directions (e.g., 'Increased hematocrit' and 'Reduced hematocrit'), while others suggest potentially protective effects (e.g., against dyslipidemia in cluster 'Increased inflammation').This behavior might indicate that CKD can affect the same metabolic pathways differently, confirming the genetic complexity of the disease.Additionally, the clusters have a limited degree of overlap, and as each represents a specific set of variants, participants might be high risk (i.e., in the top decile of the polygenic score) for more than one cluster.This additive disease model, similar to the mutational signatures in cancer, suggests a possible interplay of genetic susceptibility to multiple disease-causing mechanisms 41 .
In summary, by clustering genetic variants associated with CKD, we identified clusters with distinct trait associations, likely representing mechanistic pathways involved in CKD.We confirmed the validity of these clusters phenotypically.Further clinical investigations could explore whether individuals with a common disrupted pathway also share similar complications, a comparable rate of disease progression, or a different treatment response.In the future, classifying patients with CKD using their genotype may improve care by offering a more personalized and genetically informed clinical plan.

Trait-variants selection
We identified and aligned the alleles of 508 independent genetic variants associated either with decreased kidney function (defined as low eGFR levels for at least three months) or with CKD (using ICD-9/10 codes) from the most recent GWAS and GWAS meta-analyses 10-12, 40, 42, 43 (Fig. 1a).We then used the R package LDlinkR (R version 4.2.1) to retrieve all proxy SNPs in linkage disequilibrium (r 2 > = 0.6) with the lead variants, across all available 1000G human populations 44,45 .We used the GWAS Catalog database to link the proxy SNPs to 805 associated traits (as of July 30th, 2022) 13 .We excluded gender-specific GWAS and GWAS performed on less than 100 individuals.Additionally, as we are interested in secondary features associated with CKD, we excluded GWAS of traits directly related to eGFR or CKD (e.g., "Mild to moderate chronic kidney disease, " "Estimated Glomerular Filtration Rate").We kept trait-variant associations with a significance threshold of less than 1 × 10 −6 using a Bonferroni correction for all 2401 associations in our data set.To reduce sparsity in the data, we excluded traits associated with less than five variants; this threshold was empirically defined by comparing the clustering results of traits associated with up to 15 CKD variants.We standardized effect sizes across all GWAS by dividing the regression coefficient beta (B) by the standard error, using the GWAS summary statistic results.Traits and variants were then arranged as a matrix with the standardized effect sizes (β) as values.Table S5 contains, for each input CKD variant, the list of CKD-associated secondary traits extracted from the GWAS Catalog and the corresponding exclusion criteria for those excluded during the filtering steps.

NMF
NMF factorizes the input matrix of trait-variant associations (X, of dimensions 229 × 322) into a matrix of traits (H, 229xK) and one of variants (W, Kx322), so that HxW ≈ X 15 (Fig. 1b).The factorization rank K corresponds to the number of clusters.We implemented NMF using the R package ButchR with 10,000 iterations, 30 random initiations, and the convolution threshold set to 80 46 .The number of expected clusters was set between 2 and 20.ButchR suggests the optimal K based on six cluster evaluation metrics, like the mean silhouette width and the Frobenius error.If two or more K were presented, we considered results with the highest mean silhouette width and the lowest Frobenius error, as suggested by Alexandrov et al. 41 .As additional validation, we also performed a Bayesian version of NMF 16 , using the code provided by Udler et al. 17 .bNMF was run 1000 times with up to 200,000 iterations in each run.

Pathway analyses
To identify cluster-specific enriched pathways, we conducted three different pathway analysis approaches using the closest genes to the variants (n = 262).For gene set enrichment analysis (GSEA), using the clusterProfiler R package 47 , we tested for enrichment of the Hallmark, Ractome and KEGG [48][49][50][51] gene pathways from MSigDB.Ranks of the genes were defined based on the weights from the variant-cluster matrix H.If a gene was annotated to multiple variants, we only considered the highest weights.Using the top 25% weighted SNPs and their corresponding annotated genes, we further conducted Ingenuity Pathway Analysis 52 and Overrepresentation Analysis using the WEB-based GEne SeT AnaLysis (WebGestalt) toolkit 53 , focusing on the KEGG, Reactome, Panther, and Wiki pathways.

Cluster-specific polygenic scores
The results of clustering provide cluster-specific weights for each variant and trait.We used PLINK and the variant cluster weights to calculate cluster-specific polygenic scores (cPGS) of the BioMe biobank participants 54 .cPGS were standardized within each cluster.The normality of each cPGS distribution was tested with the Anderson-Darling method.Differences between ancestry-specific distributions were tested with the Mann-Whitney test.

Validation cohort (BioMe)
We validated our results using the genetic and linked electronic health records (EHR) data of 31,701 BioMe biobank participants 55 (Fig. 1c).As a fine-scale population structure can improve the risk prediction of complex diseases within genetic groups 56 , we inferred the genetic ancestry of the BioMe participants.We then performed a Principal Component Analysis (PCA) using PLINK, excluding relatives above 2nd-degree (kinship method, estimated using KING 57 ) and variants with minor allele frequency below 0.05 54,58 .We trained a random forest classifier to infer the genetic ancestry of BioMe participants using the 1000 Genomes labels as reference 59 .The labeled ancestries are Admixed American (AMR, n = 5336), African (AFR, n = 5660), European (EUR, n = 7447), South Asian (SAS, n = 613), and East Asian (EAS, n = 728).(n = 11,404).

Modeling disease outcomes as a function of cluster-specific polygenic scores
For each cluster, the cPGS were associated with the phenotypes available in the BioMe data set by performing a phenome-wide association study (cPGS-PheWAS).We fitted linear regression models to analyze 988 quantitative traits (e.g., laboratory results) and logistic regression models for 832 binary traits with cPGS as independent variables, adjusting for sex, age, and the first ten genetic principal components (stats R package 61 ).Binary traits included Phecodes mapped to ICD-9 and ICD-10 codes (a Phecode is considered if at least two relevant diagnostic codes were present in a patient's EHR) 62 and curated phenotypes 63 .Controls were identified as the reference category.Traits were only considered if present or measured in at least 100 biobank participants.The model

Figure 1 .
Figure 1.Methods overview.(a)We selected 508,322 independent CKD-associated variants from the summary statistics of published GWAS.For each of them, we retrieved all independent proxy SNPs in linkage disequilibrium (r 2 > = 0.6) and (from the GWAS Catalog) 805,229 proxy-associated traits with their respective effect size (B).(b) After multiple filtering steps, Wwe standardised the effect sizes across all GWAS (β) and generated an association matrix X of dimensions 229 × 322.NMF factorizes X into a matrix of traits (W) and one of variants (H), which share a dimension K (i.e., the number of clusters).(c) We extracted the weights of each cluster from the H matrix and used them to calculate cluster-specific polygenic scores (cPGS) of 31,701 BioMe participants.After standardizing the cPGS, we conducted a cPGS-PheWAS for each cluster to validate their respective top traits, which were extracted from the W matrix.

Figure 2 .Figure 3 .
Figure 2. Top seven CKD-associated secondary traits per cluster (also available as LaTeX code).The top seven secondary traits per cluster are shown with their effect direction (Effect columns) and respective cluster weights (Weight columns).'HDL' is high-density lipoprotein, 'VLDL' is very low-density lipoprotein, 'meas.' is measurement, and 'conc.' is concentration.

Figure 4 .
Figure 4. Summary statistics of the weights of each cluster (also available as LaTeX code).'SNPs' indicates the number of CKD variants with a weight > 0. The minimum weight in all clusters is 1e-45.'Q 90 weight' is the minimum weight of the SNPs in the cluster's top decile.

Figure 5 .
Figure 5. Standardized cluster-specific polygenic scores (cPGS) per genetic population.The figure compares the standardized cPGS distributions between inferred ancestries of the BioMe participants.The x-axis represents the units of standard deviation (or z-scores).AFR, AMR, and EUR refer to the sub-cohorts of individuals with inferred African, Ad Mixed American, and European ancestry, respectively. )